Direct segmentation of brain white matter tracts in diffusion MRI

The brain white matter consists of a set of tracts that connect distinct regions of the brain. Segmentation of these tracts is often needed for clinical and research studies. Diffusion-weighted MRI offers unique contrast to delineate these tracts. However, existing segmentation methods rely on intermediate computations such as tractography or estimation of fiber orientation density. These intermediate computations, in turn, entail complex computations that can result in unnecessary errors. Moreover, these intermediate computations often require dense multi-shell measurements that are unavailable in many clinical and research applications. As a result, current methods suffer from low accuracy and poor generalizability. Here, we propose a new deep learning method that segments these tracts directly from the diffusion MRI data, thereby sidestepping the intermediate computation errors. Our experiments show that this method can achieve segmentation accuracy that is on par with the state of the art methods (mean Dice Similarity Coefficient of 0.826). Compared with the state of the art, our method offers far superior generalizability to undersampled data that are typical of clinical studies and to data obtained with different acquisition protocols. Moreover, we propose a new method for detecting inaccurate segmentations and show that it is more accurate than standard methods that are based on estimation uncertainty quantification. The new methods can serve many critically important clinical and scientific applications that require accurate and reliable non-invasive segmentation of white matter tracts.


Introduction
The brain white matter is organized into a set of distinct tracts. These tracts are bundles of myelinated axons that connect different brain regions such as the cerebral cortex and the deep gray matter. Although they are tightly packed and often cross one another, each tract has an entirely different function and connects different regions of the brain [32,36]. Accurate segmentation of these tracts is needed in clinical studies and medical research. For example, in surgical planning one needs to know the precise extent of the individual tracts in order to assess the risk of damage to specific neurocognitive functions that may result from surgical removal of brain tissue. As another prominent example, changes in the micro-structural properties of different tracts is commonly used in studying brain development and disorders.
Magnetic resonance imaging (MRI) is the modality of choice for non-invasive assessment of white matter tracts in vivo. Although some of the tracts may be identifiable on T1, T2, or FLAIR images [36], accurate segmentation of most tracts is only possible with diffusion MRI. Individual tracts may be extracted from whole-brain tractograms by specifying inclusion and exclusion regions of interest (ROIs). This process, which is usually referred to as "virtual dissection", is time-consuming, subjective, and it has low reproducibility [24]. Some prior works have aimed at automating the virtual dissection process by learning to compute the inclusion/exclusion ROIs [28,37]. It is also possible to extract the tracts from a whole-brain tractogram by grouping similar streamlines using a clustering approach. This can be done by comparing individual streamlines with a predefined set of tracts in an atlas [7,13]. Some techniques additionally take into account the location of the streamlines relative to anatomical landmarks in the brain [25,26]. Tractography-based methods are inherently limited by the errors in streamline tractography [14]. To avoid these errors, some methods segment the tracts on diffusion tensor or fiber orientation images, thereby avoiding the tractography. Some of the segmentation techniques that have been explored in the past include Markov Random Fields [3], k-nearest neighbors technique [19], template matching [6], and more recently deep learning [5,34]. However, none of these intermediate parameters (e.g., the diffusion tensor) have an unambiguous biophysical meaning and their computation entails unavoidable estimation errors. Moreover, the intermediate computations for most existing methods assume availability of dense multi-shell diffusion MRI measurements, which are not acquired in many clinical and research applications. As a result, existing methods have low accuracy and limited generalizability when applied to typical clinical scans.
In this work, we develop and validate a new method that segments white matter tracts directly from the diffusion MRI data. The new method does not require tractography or computation of other intermediate parameters. Moreover, we present a simple but effective technique for detecting less accurate segmentations. We show that the new methods achieve superior accuracy and generalizability compared with the existing methods.

Segmentation model
Our method, shown schematically in Figure 1, is based on a fully convolutional network (FCN). The network architecture is similar to nnU-Net (we refer to [9] for the details of the architecture). Our method predicts tract segmentations directly from the diffusion MRI data. To enhance the generalizability of the method and to enable it to work with scans acquired using different gradient tables (i.e., different gradient strengths and/or different gradient directions): (i) We train the model with measurements that are typically acquired for diffusion tensor imaging (DTI). DTI-style scans include single-shell measurements at a b-value of around 750-1200 s/mm 2 [10]. They are the most common acquisition in clinical and research applications. We normalize these measurements by a non-weighted (b=0) measurement. (ii) We project the normalized data onto a fixed spherical harmonics (SH) basis. We use SH bases formulations of [30] with an order 2, which results in 6 SH coefficients regardless of the number of measurements. We use these 6 coefficient maps as input to the FCN. Our approach of using the data as the model input has three advantages: (1) It eliminates the need to compute intermediate parameters (e.g., fiber orientation distribution or tractogram), thereby avoiding the associated computational errors [23,22]. If the goal is tract segmentation, there is no need to incur those errors by going through intermediate computations.
(2) It improves the generalizability of the method with respect to different acquisition schemes. If, for example, the input is the tractogram, the tract segmentation results can be significantly influenced by the tractography method that is used to compute the tractogram. Moreover, computation of intermediate parameters may demand especial measurement schemes that may be unavailable at test time. For example, methods that are based on fiber orientation distribution typically require high angular resolution measurements, which can result in a loss of accuracy if such measurements are not available [4,34].
(3) It offers a highly effective data augmentation method during both training and test/inference. Data augmentation during training improves the training of large deep learning models with limited data. It is especially common in applications such as medical imaging where labeled data are costly to obtain. Test-time data augmentation, on the other hand, can be used to improve prediction accuracy and also to estimate prediction uncertainty [1,16,17]. Our trainand test-time data augmentation strategies are explained below.
Let us denote the set of b0-normalized measurements in a scan with {x(q i )} m i=1 , where q i is the unit vector indicating the gradient direction for the i th measurement. During training, in each iteration we select a subset of size 6-12 from the m measurements {x(q j )} j∈S⊆{1,...,m} , chosen uniformly at random without replacement. We select these measurements such that the gradient directions for each pair of measurements are far apart in the q space, using an approach similar to [10,27]. We use the selected measurement subset (after projecting onto the SH basis) as input to the model. This can act as a highly effective and computationally-efficient data augmentation strategy as it presents a different view of the input to the model in each training iteration.
During inference, we use n different measurement subsets, selected similarly as in training described above, to predict n different segmentations. Let us denote the segmentation probability map for a specific tract with each of these measurement subsets as {y k } n k=1 . We compute the voxel-wise average of these predictions to obtain a final segmentation prediction, which we denote withȳ. Furthermore, we can compute a measure of disagreement between these n predictions to estimate segmentation uncertainty. Disagreement between segmentation predictions is usually quantified using metrics of volume overlap or surface distance [29]. Each of these metrics quantifies the segmentation error from a narrow perspective. Furthermore, these metrics discard the probability information by binarizing the segmentations. Recent segmentation uncertainty quantification methods have also followed a purely voxel-wise approach [15,33], which ignores the spatial distribution of the segmentation probabilities. To characterize the disagreement in a way that accounts for the complete probability distribution of the predicted segmentations, we use a method based on the Wasserstein Distance, also known as earth mover's distance (EMD) [21]. Given two probability distributions p and q defined on the same metric space, this distance is defined as EMD(p, q) = inf γ∈Γ (p,q) E (x,y)∼γ d(x, y), where d is a distance measure and Γ (p, q) is the set of joint probability distributions whose marginals are equal to p and q. Intuitively, if p and q are considered as two piles of earth, EMD is the cost of turning one into the other. Although EMD can be easily quantified for scalar variables, to the best of our knowledge there are no methods for computing EMD for probability distributions in IR 2 or IR 3 . Here, we adopt an approximation that was originally proposed in [35] for comparing multi-dimensional histograms. We demonstrate this computation for a simple 3 × 3 histogram in Figure 2. Given a pair of multi-dimensional histograms (or probability distributions), the method first unfolds the histograms as shown in the example in Figure 2 and finds a minimum distance pairing between the two. The distance between the two histograms is defined as the sum of the pair-wise distances in the pairing.
Based on this approximation, we compute the EMD between two segmentation probability maps in R 3 as EMD(p, q) = tions computed as explained above, we estimate the segmentation uncertainty as u = 1 n k EMD(y k ,ȳ).

Implementation details
The segmentation network was implemented in TensorFlow 1.6 and run on an NVIDIA GeForce GTX 1080 GPU on a Linux machine with 64 GB of memory and 20 CPU cores. The network takes 3D patches of size 96 3 voxels as input and estimate the tract segmentation map for that patch. The network input has 6 channels as described above. The network output has 41 channels for the 41 tracts considered in this work. A complete description of these tracts can be found in [34]. We merged the left and right sections of bilateral tracts, such as arcuate fasciculus, into one label. We trained the network to maximize the Dice similarity coefficient (DSC) between the predicted and ground-truth segmentation of the tracts using Adam [12] with a batch size of 1 and a learning rate of 10 −4 , which was reduced by half if after a training epoch the validation loss did not decrease. We compare our method with TractSeg [34]. TractSeg was shown to be vastly superior to many tractography dissection methods [34]. Therefore, we do not compare with those methods.

Experimental Results and Discussion
We applied the method on 105 subjects in the Human Connectome Project study [8,31]. Manual segmentations of 72 tracts for these subjects are publicly available [34]. We followed a five-fold cross-validation approach, each time leaving 21 subjects for test and training on the remaining 84 subjects. Table 1 summarizes the performance of the proposed method and TractSeg. We report DSC, 95 percentile of the Hausdorff Distance (HD95), and average symmetric surface distance (ASSD). In addition to TractSeg, we compare our method with atlasbased segmentation (MAS), whereby 20 training images are registered to the test subject and the registration transforms are used to warp the segmentation labels from the training images to the test image. Voxel-wise averaging is then used to estimate the segmentations for the test image. We implemented this in two ways: MAS-FA, where we computed the registrations based on fractional  [2], and MAS-FOD, where we computed the registrations based on fiber orientation density images using mrregister [18]. Segmentation performance results are presented in Table 1. Figure 3 shows example tract segmentations predicted by our method and TractSeg. Our method using only the DTI measurements (b=1000) achieved segmentation accuracy that was very close to TractSeg using the multi-shell data with three times as many measurements. Paired t-tests did not show any significant differences (at p = 0.01) between our method and TractSeg in terms of any of the three criteria. When TractSeg was applied on the b=1000 measurements, its performance was worse than our method in terms of all three criteria. To simulate under-sampled clinical scans, we selected 6 of the b=1000 measurements as proposed in [10,27]. As shown in Table 1, the performance of our method remained almost unchanged, whereas the performance of TractSeg deteriorated significantly. Paired t-tests with a p = 0.01 threshold showed that (1) the performance of our method did not change in terms of any of the three criteria on any of the 41 tracts when 6 measurements were used compared with 90 measurements. (2) Our method achieved significantly higher DSC and significantly lower HD95 and ASSD (all with p < 0.01) with both 90 and six measurements compared with the other three methods. As shown in Figure 3, segmentations produced by our method are almost indistinguishable between 90 and 6 measurements. Although we cannot present the segmentation results for all tracts, Table 2 shows the mean DSC for six of the tracts, including anterior commissure and fornix which were the two most difficult tract to segment for our method and for TractSeg.
We further tested our method on scans of children between 2-8 years of age from an independent dataset [20]. Each scan in this dataset included 30 measurements in a single shell at b=750. We chose six measurements as input to our model as described above. We manually extracted 32 tracts from 12 different subjects on this dataset. Our method achieved DSC, HD95, and ASSD of 0.786 ± 0.076, 2.85 ± 1.20, and 1.017 ± 0.291, respectively. Although this shows a drop in accuracy, it is a highly encouraging result given the fact that this was a completely independent test dataset that was different from our train-   (21-36 years), and (2) measurement b-value of 750 versus 1000. Compared with our method, TractSeg failed on this dataset, completely missing most of the tracts and achieving a mean DSC of 0.070. To further evaluate the reproducibility of our method on this dataset, we selected two disjoint subsets of six measurements from each scan and applied our method to segment the tracts. We computed the DSC between the tracts computed with the two measurement subsets. We did this for 100 scans, each from a different subject. The DSC for our method was 0.867 ± 0.041, whereas it was 0.115 ± 0.109 for TractSeg. Example results for our method on this dataset are shown in Figure 4. Figure 5 shows a plot of our proposed segmentation uncertainty, u, versus accuracy in terms of DSC. It shows that u is highly effective in identifying the less accurate segmentations. If we choose segmentations with a DSC of 0.70 and lower to be inaccurate, with a threshold of u = 0.30 we can detect such segmentation with sensitivity=0.86, specificity=0.92, and accuracy=0.91. In Table 6 we compare method with the two standard methods based on estimation segmentation uncertainty: dropout, and ensemble methods. We refer to [15] for a description of these methods. Our method achieves overall better results. Note that the ensemble method requires training of multiple models. We trained 10 models in this experiment, which increased the training time by a factor of 10.

Computational time and other experiments
Training time for our method is approximately 24 hours. Our method segments a test image in 2.4 seconds. TractSeg requires approximately 60 seconds to segment an image. MAS methods require much longer time, approximately 3 minutes for MAS-FA and 12 minutes for MAS-FOD. In recent years attention-based vision models have become very common in medical image segmentation. To experiment with one such model, we applied the model of [11], which has been developed specifically for 3D medical image segmentation. This model achieved a DSC of 0.740 ± 0.125, which was far lower segmentation performance those reported above.

Conclusions
Our method shows great promise in segmenting various white matter tracts. The appeal of our method is twofold: (1) Superior accuracy on under-sampled data that are typical of clinical scans, as clearly demonstrated by our results in Figure